{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#Tutorial 快速入门\n",
    "##数据准备\n",
    "faiss可以处理固定维度d的向量集合，这样的集合这里用二维数组表示。\n",
    "一般来说，我们需要两个数组：  \n",
    "1.data。包含被索引的所有向量元素；  \n",
    "2.query。索引向量，我们需要根据索引向量的值返回xb中的最近邻元素。  \n",
    "为了对比不同索引方式的差别，在下面的例子中我们统一使用完全相同的数据，即维数d为512，data包含2000个向量，每个向量符合正态分布。  \n",
    "需要注意的是，faiss需要数组中的元素都是32位浮点数格式。 datatype = 'float32'。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAADjJJREFUeJzt3X2MZfVdx/H3RxaklMryMNkACw5JSSvWWshIaDC1dmukYoEYQpaoXcmaTbRWKka71j+I/WuJprWmpnVTareVUAilAQU1uNBUTbo6PJSnBdlQHhYXdipPrY3WpV//uIdmXIeZu/fcmbvz4/1KJvecc3/nnu83u/OZc3/3nntTVUiS2vVDky5AkrS8DHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS49ZMugCAk046qaanpyddhiStKnffffe3qmpqqXGHRdBPT08zOzs76TIkaVVJ8uQw45y6kaTGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxh0WV8ZKh6vprbdN7NhPbLtwYsdWWzyjl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktS4JYM+yeeS7E/y4LxtJyS5I8lj3e3x3fYk+bMke5Lcn+Sc5SxekrS0Yc7oPw9ccNC2rcDOqjoT2NmtA7wPOLP72QJ8ejxlSpJGtWTQV9XXgOcP2nwxsKNb3gFcMm/7F2rg68DaJCePq1hJ0qEbdY5+XVXt65afBdZ1y6cCT88bt7fbJkmakN4vxlZVAXWo+yXZkmQ2yezc3FzfMiRJr2HUoH/u1SmZ7nZ/t/0Z4LR549Z32/6fqtpeVTNVNTM1NTViGZKkpYwa9LcCm7rlTcAt87Z/oHv3zXnAS/OmeCRJE7DkN0wluR54N3BSkr3A1cA24MYkm4Engcu64bcDvwDsAb4LXLEMNUuSDsGSQV9Vl7/GXRsWGFvAB/sWJUkaH6+MlaTGGfSS1DiDXpIat+QcvaTJmN5620SO+8S2CydyXC0fz+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOrxLUqjCpr9WTWuAZvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjesV9El+J8lDSR5Mcn2So5OckWRXkj1Jbkhy1LiKlSQdupGDPsmpwG8DM1X1NuAIYCNwDfCJqnoz8AKweRyFSpJG03fqZg3whiRrgGOAfcB7gJu6+3cAl/Q8hiSph5GDvqqeAf4EeIpBwL8E3A28WFUHumF7gVMX2j/JliSzSWbn5uZGLUOStIQ+UzfHAxcDZwCnAG8ELhh2/6raXlUzVTUzNTU1ahmSpCX0mbp5L/DNqpqrqv8BbgbOB9Z2UzkA64FnetYoSeqhT9A/BZyX5JgkATYADwN3AZd2YzYBt/QrUZLUR585+l0MXnS9B3ige6ztwEeAq5LsAU4Erh1DnZKkEfX6PPqquhq4+qDNjwPn9nlcSdL4eGWsJDXOoJekxhn0ktQ4g16SGueXg+uQ+CXd0urjGb0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxvYI+ydokNyV5JMnuJO9MckKSO5I81t0eP65iJUmHru8Z/SeBv6uqtwI/CewGtgI7q+pMYGe3LkmakJGDPslxwLuAawGq6ntV9SJwMbCjG7YDuKRvkZKk0fU5oz8DmAP+Msm9ST6b5I3Auqra1415FljXt0hJ0uj6BP0a4Bzg01V1NvCfHDRNU1UF1EI7J9mSZDbJ7NzcXI8yJEmL6RP0e4G9VbWrW7+JQfA/l+RkgO52/0I7V9X2qpqpqpmpqakeZUiSFrNm1B2r6tkkTyd5S1U9CmwAHu5+NgHbuttbxlKppBUxvfW2iR37iW0XTuzYLRs56DsfAq5LchTwOHAFg2cJNybZDDwJXNbzGJKkHnoFfVXdB8wscNeGPo8rSRofr4yVpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWpc76BPckSSe5P8Tbd+RpJdSfYkuSHJUf3LlCSNahxn9FcCu+etXwN8oqreDLwAbB7DMSRJI+oV9EnWAxcCn+3WA7wHuKkbsgO4pM8xJEn99D2j/1Pg94Hvd+snAi9W1YFufS9was9jSJJ6WDPqjkl+EdhfVXcnefcI+28BtgCcfvrpo5bxujS99bZJlyBpFelzRn8+cFGSJ4AvMZiy+SSwNsmrf0DWA88stHNVba+qmaqamZqa6lGGJGkxIwd9Vf1BVa2vqmlgI3BnVf0ycBdwaTdsE3BL7yolSSNbjvfRfwS4KskeBnP21y7DMSRJQxp5jn6+qvoq8NVu+XHg3HE8riSpP6+MlaTGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0z6CWpcQa9JDXOoJekxhn0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIaZ9BLUuMMeklqnEEvSY0bOeiTnJbkriQPJ3koyZXd9hOS3JHkse72+PGVK0k6VH3O6A8Av1tVZwHnAR9MchawFdhZVWcCO7t1SdKEjBz0VbWvqu7plr8N7AZOBS4GdnTDdgCX9C1SkjS6sczRJ5kGzgZ2Aeuqal9317PAutfYZ0uS2SSzc3Nz4yhDkrSA3kGf5Fjgy8CHq+rl+fdVVQG10H5Vtb2qZqpqZmpqqm8ZkqTX0CvokxzJIOSvq6qbu83PJTm5u/9kYH+/EiVJffR5102Aa4HdVfXxeXfdCmzqljcBt4xeniSprzU99j0f+FXggST3dds+CmwDbkyyGXgSuKxfiZJeL6a33jaR4z6x7cKJHHeljBz0VfVPQF7j7g2jPq4kaby8MlaSGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1rs/76F/3JvWeX0k6FJ7RS1LjPKOX9Lo3yWfnK3FVrmf0ktQ4g16SGmfQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUOINekhpn0EtS4wx6SWqcQS9JjTPoJalxBr0kNc6gl6TGGfSS1DiDXpIat+q/StAv6JakxXlGL0mNW5agT3JBkkeT7EmydTmOIUkaztiDPskRwJ8D7wPOAi5Pcta4jyNJGs5ynNGfC+ypqser6nvAl4CLl+E4kqQhLEfQnwo8PW99b7dNkjQBE3vXTZItwJZu9TtJHp1390nAt1a+qomw1zbZa3uWpc9c02v3Hx1m0HIE/TPAafPW13fb/o+q2g5sX+gBksxW1cwy1HbYsdc22Wt7VnOfyzF186/AmUnOSHIUsBG4dRmOI0kawtjP6KvqQJLfAv4eOAL4XFU9NO7jSJKGsyxz9FV1O3B7j4dYcEqnUfbaJnttz6rtM1U16RokScvIj0CQpMZNLOiTnJbkriQPJ3koyZULjPm9JPd1Pw8meSXJCZOot48hez0uyV8n+UY35opJ1NrXkL0en+QrSe5P8i9J3jaJWvtKcnRX/6v/Zn+0wJgfTnJD93Egu5JMr3yl/QzZ57uS3JPkQJJLJ1HnOAzZ61Xd/+/7k+xMMtRbHCeqqibyA5wMnNMtvwn4N+CsRca/H7hzUvUud6/AR4FruuUp4HngqEnXvky9/jFwdbf8VmDnpOsesdcAx3bLRwK7gPMOGvObwGe65Y3ADZOue5n6nAbeDnwBuHTSNS9zrz8LHNMt/8Zq+Ded2Bl9Ve2rqnu65W8Du1n8CtrLgetXorZxG7LXAt6UJMCxDIL+wIoWOgZD9noWcGc35hFgOsm6FS10DGrgO93qkd3PwS96XQzs6JZvAjZ0/8arxjB9VtUTVXU/8P2Vrm+chuz1rqr6brf6dQbXCh3WDos5+u7p7NkM/noudP8xwAXAl1euquWxSK+fAn4M+HfgAeDKqlrVvzSL9PoN4Je6MecyuLrvsP9lWUiSI5LcB+wH7qiqg3v9wUeCVNUB4CXgxJWtsr8h+mzGIfa6GfjblalsdBMP+iTHMgjwD1fVy68x7P3AP1fV8ytX2fgt0evPA/cBpwDvAD6V5EdWuMSxWaLXbcDa7pfpQ8C9wCsrXOJYVNUrVfUOBn+ozl2trzcs5fXSJwzfa5JfAWYYTEUe1iYa9EmOZBAG11XVzYsM3cgqnbZ51RC9XgHc3D113AN8k8H89aqzVK9V9XJVXdH9Mn2AwWsSj69wmWNVVS8CdzF45jnfDz4SJMka4DjgP1a2uvFZpM/mLNZrkvcCfwhcVFX/vdK1HapJvusmwLXA7qr6+CLjjgN+BrhlpWobtyF7fQrY0I1fB7yFVRh+w/SaZG338RgAvw58bZFnc4etJFNJ1nbLbwB+DnjkoGG3Apu65UsZvKFgVV28MmSfTRim1yRnA3/BIOT3r3yVh25iF0wl+WngHxnMR786F/1R4HSAqvpMN+7XgAuqauMEyhyLYXpNcgrweQbvWgmwrar+auWr7WfIXt/J4AXKAh4CNlfVCxMot5ckb2fQxxEMTppurKqPJfkYMFtVtyY5Gvgig9cqngc2VtWq+gM+ZJ8/BXwFOB74L+DZqvrxiRU9oiF7/QfgJ4B93W5PVdVFk6l4OF4ZK0mNm/iLsZKk5WXQS1LjDHpJapxBL0mNM+glqXEGvSQ1zqCXpMYZ9JLUuP8F80WBUXsFyawAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np \n",
    "d = 512          #维数\n",
    "n_data = 2000   \n",
    "np.random.seed(0) \n",
    "data = []\n",
    "mu = 3\n",
    "sigma = 0.1\n",
    "for i in range(n_data):\n",
    "    data.append(np.random.normal(mu, sigma, d))\n",
    "data = np.array(data).astype('float32')\n",
    "\n",
    "# print(data[0])\n",
    "\n",
    "# 查看第六个向量是不是符合正态分布\n",
    "import matplotlib.pyplot as plt \n",
    "plt.hist(data[5])\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = []\n",
    "n_query = 10\n",
    "mu = 3\n",
    "sigma = 0.1\n",
    "np.random.seed(12) \n",
    "query = []\n",
    "for i in range(n_query):\n",
    "    query.append(np.random.normal(mu, sigma, d))\n",
    "query = np.array(query).astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##精确索引\n",
    "在使用faiss时，我们是围绕index对象进行的。index中包含被索引的数据库向量，在索引时可以选择不同方式的预处理来提高索引的效率，表现维不同的索引类型。在精确搜索时选择最简单的IndexFlatL2索引类型。  \n",
    "IndexFlatL2类型遍历计算查询向量与被查询向量的L2精确距离，不需要训练操作（大部分index类型都需要train操作）。  \n",
    "在构建index时要提供相关参数，这里是向量维数d，构建完成index之后可以通过add()和search（）进行查询。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "2000\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "sys.path.append('/home/maliqi/faiss/python/')\n",
    "import faiss\n",
    "index = faiss.IndexFlatL2(d)  # 构建index\n",
    "print(index.is_trained)  # False时需要train\n",
    "index.add(data)  #添加数据\n",
    "print(index.ntotal)  #index中向量的个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.       8.007045 8.313328 8.53525  8.560175 8.561642 8.624167 8.628234\n",
      "  8.709978 8.77004 ]\n",
      " [0.       8.27809  8.355579 8.42606  8.462017 8.468868 8.487028 8.549963\n",
      "  8.562824 8.599199]\n",
      " [0.       8.152368 8.156569 8.223303 8.276016 8.376871 8.379269 8.406122\n",
      "  8.418619 8.443283]\n",
      " [0.       8.260519 8.336826 8.339298 8.40288  8.46439  8.474661 8.479043\n",
      "  8.485248 8.526599]\n",
      " [0.       8.346273 8.407202 8.462828 8.49723  8.520801 8.597084 8.600386\n",
      "  8.605133 8.630594]]\n",
      "[[   0  798  879  223  981 1401 1458 1174  919   26]\n",
      " [   1  981 1524 1639 1949 1472 1162  923  840  300]\n",
      " [   2 1886  375 1351  518 1735 1551 1958  390 1695]\n",
      " [   3 1459  331  389  655 1943 1483 1723 1672 1859]\n",
      " [   4   13  715 1470  608  459  888  850 1080 1654]]\n"
     ]
    }
   ],
   "source": [
    "k = 10  # 返回结果个数\n",
    "query_self = data[:5]  # 查询本身\n",
    "dis, ind = index.search(query_self, k)\n",
    "print(dis)  # 升序返回每个查询向量的距离\n",
    "print(ind)  # 升序返回每个查询向量的k个相似结果"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为查询向量是数据库向量的子集，所以每个查询向量返回的结果中排序第一的是其本身，L2距离是0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[8.61838   8.782156  8.782816  8.832029  8.837633  8.848496  8.897978\n",
      "  8.916636  8.919006  8.9374   ]\n",
      " [9.033303  9.038907  9.091705  9.15584   9.164591  9.200112  9.201884\n",
      "  9.220335  9.279477  9.312859 ]\n",
      " [8.063818  8.211029  8.306456  8.373352  8.459253  8.459892  8.498557\n",
      "  8.546464  8.555408  8.621426 ]\n",
      " [8.193894  8.211956  8.34701   8.446963  8.45299   8.45486   8.473572\n",
      "  8.50477   8.513636  8.530684 ]\n",
      " [8.369624  8.549444  8.704066  8.736764  8.760082  8.777319  8.831345\n",
      "  8.835486  8.858271  8.860058 ]\n",
      " [8.299072  8.432398  8.434382  8.457374  8.539217  8.562359  8.579033\n",
      "  8.618736  8.630861  8.643393 ]\n",
      " [8.615004  8.615164  8.72604   8.730943  8.762621  8.796932  8.797068\n",
      "  8.797365  8.813985  8.834726 ]\n",
      " [8.377227  8.522776  8.711159  8.724562  8.745737  8.763846  8.768602\n",
      "  8.7727995 8.786856  8.828224 ]\n",
      " [8.342917  8.488056  8.655106  8.662771  8.701336  8.741287  8.743608\n",
      "  8.770507  8.786264  8.849051 ]\n",
      " [8.522164  8.575703  8.68462   8.767247  8.782909  8.850494  8.883733\n",
      "  8.90369   8.909393  8.91768  ]]\n",
      "[[1269 1525 1723 1160 1694   48 1075 1028  544  916]\n",
      " [1035  259 1279 1116 1398  879  289  882 1420 1927]\n",
      " [ 327  345 1401  389 1904 1992 1612  106  981 1179]\n",
      " [1259  112  351  804 1412 1987 1377  250 1624  133]\n",
      " [1666  854 1135  616   94  280   30   99 1212    3]\n",
      " [ 574 1523  366  766 1046   91  456  649   46  896]\n",
      " [1945  944  244  655 1686  981  256 1555 1280 1969]\n",
      " [ 879 1025  390  269 1115 1662 1831  610   11  191]\n",
      " [ 156  154   99   31 1237  289  769 1524   56  661]\n",
      " [ 427  182  375 1826  610 1384 1299  750    2 1430]]\n"
     ]
    }
   ],
   "source": [
    "k = 10\n",
    "dis, ind = index.search(query, k)\n",
    "print(dis)\n",
    "print(ind)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##倒排表快速索引\n",
    "在数据量非常大的时候，需要对数据做预处理来提高索引效率。一种方式是对数据库向量进行分割，划分为多个d维维诺空间，查询阶段，只需要将查询向量落入的维诺空间中的数据库向量与之比较，返回计算所得的k个最近邻结果即可，大大缩减了索引时间。  \n",
    "nlist参数控制将数据集向量分为多少个维诺空间；  \n",
    "nprobe参数控制在多少个维诺空间的范围内进行索引。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[8.61838   8.782156  8.782816  8.832029  8.837633  8.848496  8.897978\n",
      "  8.916636  8.919006  8.9374   ]\n",
      " [9.033303  9.038907  9.091705  9.15584   9.164591  9.200112  9.201884\n",
      "  9.220335  9.279477  9.312859 ]\n",
      " [8.063818  8.211029  8.306456  8.373352  8.459253  8.459892  8.498557\n",
      "  8.546464  8.555408  8.621426 ]\n",
      " [8.193894  8.211956  8.34701   8.446963  8.45299   8.45486   8.473572\n",
      "  8.50477   8.513636  8.530684 ]\n",
      " [8.369624  8.549444  8.704066  8.736764  8.760082  8.777319  8.831345\n",
      "  8.835486  8.858271  8.860058 ]\n",
      " [8.299072  8.432398  8.434382  8.457374  8.539217  8.562359  8.579033\n",
      "  8.618736  8.630861  8.643393 ]\n",
      " [8.615004  8.615164  8.72604   8.730943  8.762621  8.796932  8.797068\n",
      "  8.797365  8.813985  8.834726 ]\n",
      " [8.377227  8.522776  8.711159  8.724562  8.745737  8.763846  8.768602\n",
      "  8.7727995 8.786856  8.828224 ]\n",
      " [8.342917  8.488056  8.655106  8.662771  8.701336  8.741287  8.743608\n",
      "  8.770507  8.786264  8.849051 ]\n",
      " [8.522164  8.575703  8.68462   8.767247  8.782909  8.850494  8.883733\n",
      "  8.90369   8.909393  8.91768  ]]\n",
      "[[1269 1525 1723 1160 1694   48 1075 1028  544  916]\n",
      " [1035  259 1279 1116 1398  879  289  882 1420 1927]\n",
      " [ 327  345 1401  389 1904 1992 1612  106  981 1179]\n",
      " [1259  112  351  804 1412 1987 1377  250 1624  133]\n",
      " [1666  854 1135  616   94  280   30   99 1212    3]\n",
      " [ 574 1523  366  766 1046   91  456  649   46  896]\n",
      " [1945  944  244  655 1686  981  256 1555 1280 1969]\n",
      " [ 879 1025  390  269 1115 1662 1831  610   11  191]\n",
      " [ 156  154   99   31 1237  289  769 1524   56  661]\n",
      " [ 427  182  375 1826  610 1384 1299  750    2 1430]]\n"
     ]
    }
   ],
   "source": [
    "nlist = 50  # 将数据库向量分割为多少了维诺空间\n",
    "k = 10\n",
    "quantizer = faiss.IndexFlatL2(d)  # 量化器\n",
    "index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)\n",
    "       # METRIC_L2计算L2距离, 或faiss.METRIC_INNER_PRODUCT计算内积\n",
    "assert not index.is_trained   #倒排表索引类型需要训练\n",
    "index.train(data)  # 训练数据集应该与数据库数据集同分布\n",
    "assert index.is_trained\n",
    "\n",
    "index.add(data)\n",
    "index.nprobe = 50  # 选择n个维诺空间进行索引,\n",
    "dis, ind = index.search(query, k)\n",
    "print(dis)\n",
    "print(ind)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过改变nprobe的值，发现在nprobe值较小的时候，查询可能会出错，但时间开销很小，随着nprobe的值增加，精度逐渐增大，但时间开销也逐渐增加，当nprobe=nlist时，等效于IndexFlatL2索引类型。  \n",
    "简而言之，倒排表索引首先将数据库向量通过聚类方法分割成若干子类，每个子类用类中心表示，当查询向量来临，选择距离最近的类中心，然后在子类中应用精确查询方法，通过增加相邻的子类个数提高索引的精确度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##乘积量化索引\n",
    "在上述两种索引方式中，在index中都保存了完整的数据库向量，在数据量非常大的时候会占用太多内存，甚至超出内存限制。  \n",
    "在faiss中，当数据量非常大的时候，一般采用乘积量化方法保存原始向量的有损压缩形式,故而查询阶段返回的结果也是近似的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[4.8332453 4.916275  5.0142426 5.0211687 5.0282335 5.039744  5.063374\n",
      "  5.0652556 5.065288  5.0683947]\n",
      " [4.456933  4.6813188 4.698038  4.709836  4.72171   4.7280436 4.728564\n",
      "  4.728917  4.7406554 4.752378 ]\n",
      " [4.3990726 4.554667  4.622962  4.6567664 4.665245  4.700697  4.7056646\n",
      "  4.715714  4.7222314 4.7242   ]\n",
      " [4.4063187 4.659938  4.719548  4.7234855 4.727058  4.7630377 4.767138\n",
      "  4.770565  4.7718883 4.7720337]\n",
      " [4.5876865 4.702366  4.7323933 4.7387223 4.7550535 4.7652235 4.7820272\n",
      "  4.788397  4.792813  4.7930083]]\n",
      "[[   0 1036 1552  517 1686 1666    9 1798  451 1550]\n",
      " [   1  725  270 1964  430  511  598   20  583  728]\n",
      " [   2  761 1254  928 1913 1886  400  360 1850 1840]\n",
      " [   3 1035 1259 1884  584 1802 1337 1244 1472  468]\n",
      " [   4 1557  350  233 1545 1084 1979 1537  665 1432]]\n",
      "[[5.184828  5.1985765 5.2006407 5.202751  5.209732  5.2114754 5.2203827\n",
      "  5.22132   5.2252693 5.2286644]\n",
      " [5.478416  5.5195136 5.532296  5.563965  5.564443  5.5696826 5.586555\n",
      "  5.5897493 5.59312   5.5942397]\n",
      " [4.7446747 4.8150816 4.824335  4.834736  4.83847   4.844829  4.850663\n",
      "  4.853364  4.856619  4.865398 ]\n",
      " [4.733185  4.7483554 4.7688575 4.783175  4.785554  4.7890463 4.7939577\n",
      "  4.797909  4.8015175 4.802591 ]\n",
      " [5.1260395 5.1264906 5.134188  5.1386065 5.141901  5.148476  5.1756086\n",
      "  5.1886897 5.192538  5.1938267]\n",
      " [4.882325  4.900981  4.9040375 4.911916  4.916094  4.923492  4.928433\n",
      "  4.928472  4.937878  4.9518585]\n",
      " [4.9729834 4.976016  4.984484  5.0074816 5.015956  5.0174923 5.0200887\n",
      "  5.0217285 5.028976  5.029479 ]\n",
      " [5.064405  5.0903125 5.0971365 5.098599  5.108646  5.113497  5.1155915\n",
      "  5.1244674 5.1263866 5.129635 ]\n",
      " [5.060173  5.0623484 5.075763  5.087064  5.100909  5.1075807 5.109309\n",
      "  5.110051  5.1323767 5.1330123]\n",
      " [5.12455   5.149974  5.151128  5.163775  5.1637926 5.1726117 5.1732545\n",
      "  5.1762547 5.1780767 5.185327 ]]\n",
      "[[1264  666   99 1525 1962 1228  366  268  358 1509]\n",
      " [ 520  797 1973  365 1545 1032 1077   71  763  753]\n",
      " [1632  689 1315  321  459 1486  818 1094  378 1479]\n",
      " [ 721 1837  537 1741 1627  154 1557  880  539 1784]\n",
      " [1772  750 1166 1799  572  997  340  127  756  375]\n",
      " [1738 1978  724  749  816 1046 1402  444 1955  246]\n",
      " [1457 1488 1902 1187 1485  986   32  531   56  913]\n",
      " [1488 1244  121 1144 1280 1078 1012 1215 1639 1175]\n",
      " [ 426   45  122 1239  300 1290  546  505 1687  434]\n",
      " [ 263  343 1025  583 1489  356 1570 1282  627 1432]]\n"
     ]
    }
   ],
   "source": [
    "nlist = 50\n",
    "m = 8                             # 列方向划分个数，必须能被d整除\n",
    "k = 10\n",
    "quantizer = faiss.IndexFlatL2(d)  \n",
    "index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 4)\n",
    "                                    # 4 表示每个子向量被编码为 4 bits\n",
    "index.train(data)\n",
    "index.add(data)\n",
    "index.nprobe = 50\n",
    "dis, ind = index.search(query_self, k)  # 查询自身\n",
    "print(dis)\n",
    "print(ind)\n",
    "dis, ind = index.search(query, k)  # 真实查询\n",
    "print(dis)\n",
    "print(ind)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实验发现，乘积量化后查询返回的距离值与真实值相比偏小，返回的结果只是近似值。  \n",
    "查询自身时能够返回自身，但真实查询时效果较差，这里只是使用了正态分布的数据集，在真实使用时效果会更好，原因有：  \n",
    "1.正态分布的数据相对更难查询，难以聚类/降维；  \n",
    "2.自然数据相似的向量与不相似的向量差别更大，更容易查找；\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
